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Abstract 

The problems that scientists face in creating well designed databases 
, intersect with the concerns of data curation. Entity-relationship modeling 



and its variants have been the basis of most relational data modeling for 
decades. However, these abstractions and the relational model itself are 
. intricate and have proved not to be very accessible among scientists with 

limited resources for data management. This paper explores one aspect 
of relational data models, the meaning of foreign key relationships. We 
CO ' observe that a foreign key produces a table relationship that generally 

^ ' references either an entity or repeating attributes. This paper proposes 

constructing foreign keys based on these two cases, and suggests that the 
method promotes intuitive data modeling and normalization. 

in 

On ■ 1 Introduction 

O , 

Science has become dependent on the ability to share and reuse data sets that 
can be very large in scale and quite heterogeneous pQ. In scientific communities 
and in digital library communities, which are concerned with dissemination and 
preservation of scientific and scholarly output, there has been a growing interest 
. in many aspects of data management. In this context the term, "data curation," 

includes the preparation of data in order to make them reusable by others. 

Data curation begins ideally when data are first produced rather than as an 
afterthought, and it implies an improvement in data integrity and, therefore, in 
the quality of data models. One of the major contributions of database research 
has been the theory of normalization [3 , which is essential to the creation of 
robust data models. Normalization is a data modeling activity often performed 
by database specialists, but many database users find that the theory is not 
easy to apply. 

Since the 1970s there has also been a lot of work on layering conceptual and 
semantic abstractions over relational databases, with extensions or mappings 
to the relational model or with data modeling techniques. Entity-relationship 
modeling [2 is the most frequently used conceptual modeling technique [4] and 
has given rise to many variations. These approaches allow users to systematize 
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their knowledge about data at a conceptual level but often involve significant 
complexity [5]. There is also a natural tension between abstract conceptual mod- 
els, which obscure the underlying database, and lower- level "logical" models, 
which require a fairly sophisticated knowledge of relational modeling. 

Many scientists avoid these difficulties by using spreadsheets or text files 
instead of databases. The files are often extended ad hoc to encode additional 
structure within a table. This leads to a variety of table-based and hierarchical 
data models, many of which proliferate unnormalized data. It seems unlikely 
that small research labs will adopt the complicated data modeling processes that 
have evolved in the database and business communities. It is possible to offer 
them services that can help with database design; however, because scientific 
research tends to evolve and branch out continually, data modeling in these 
domains often continues as an iterative process throughout the life of a research 
program. There appears to be a need for simpler methods and tools that would 
allow scientists to integrate data modeling into their research process. 

While the importance of data curation is becoming recognized, there is a 
recent movement away from relational databases and normalized data models 
in order to cope with massive data scaling requirements of many current ap- 
plications. Relational databases have always been far from ubiquitous in the 
sciences; nevertheless this most recent wave of heterogeneous database models 
is a major shift in the landscape. It suggests that databases will need to inter- 
operate with each other in various hybrid relational/non-relational information 
systems, at the same time as they have to interface with a much broader range 
of users. 

In this paper a simple data modeling abstraction is proposed in the hope of 
making database design, and by extension data curation, more accessible to the 
increasing number of users that have to manage data. 

2 Definitions 

A table design is the set of attributes {ai, . . . , a n } represented in a database ta- 
ble. In this paper we will often use "table" as a shorthand to mean table design. 
It will be assumed that every table has a primary key, which is underlined in 
these examples. We start with a traditional employee example: 

E = {erap, ediv, edept} 
D = {div, dept, addr} 

Table E represents the employees that work in a company, with each tuple hav- 
ing the name of an employee (emp) and the name of the division (ediv) and 
department (edept) that the employee works in. Table D represents the depart- 
ments in the company, with each tuple having the name of a department (dept) 
and the division it is contained in (div), as well as the department's address 
(addr). Since a modern company might be the result of several mergers, a de- 
partment name may not be unique within the company; therefore the division 
and department names are used together to identify a department. 
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This example will also have a foreign key constraint empdept from {ediv, 
edept} in E to {div, dept} in D, which will be notated as follows. Let S and T 
be tables, and ACS. Let Kt be a candidate key of T. A foreign key constraint 
F from A (in S) to K T (in T) is written as F = </)s,t(A, K t ). 

In our case we have: 

empdept = (j)E,D({ediv, edept}, {div, dept}) 

Most of the foreign key constraints described in this paper reference primary 
keys, and it will be convenient to give them a separate definition. 

Definition 1. Let S and T be tables, with ACS. Let II t be the primary key of 
T. If there exists a foreign key constraint F = (()s,t(A, IIt), then F = $>s,t(A) 
is a simple foreign key constraint. 

Tables E and D are examples of a table which contains no foreign keys as 
part of its primary key. 

Definition 2. Let S be a table with primary key lis. S is an entity table if 
there exists no foreign key constraint <Ps,t(A,K) for any T, K, or A C lis. 

Suppose that each employee can have several phone numbers and that for this 
purpose there is defined a new table P and a foreign key constraint empphone: 

P = {emp, phone} 
empphone = ^p,E({emp}) 

P is another special class of table which has a foreign key as a proper subset of 
its primary key such that the foreign key references the primary key of another 
table. 

Definition 3. Let T be a table with primary key 11^. Then T is a multivalued 
table if there exists a simple foreign key constraint &t,s(A) for some S and 

a c n T . 

Definition 4. A table is a simple table if it is an entity table or a multivalued 
table. 

Two operators are defined for describing referential constraints involving 
simple tables. 

Definition 5. Let S be a simple table, and let T be an entity table. If there 
exists a simple foreign key constraint F = <&s,t(A) for some AcS, then there 
is an entity reference from S to T, e(F,T). 

Definition 6. Let T be a multivalued table with primary key IIt- Therefore 
there exists a simple foreign key constraint F = $t,s(A) for some S and A C 
Ht- If S is a simple table, then there is a multivalued reference from S to T, 
p(F,T). 
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In our example we will include the set of entity references and multivalued 
references from each table. To summarize: 



E = {emp, ediv, edept} 
Er = {s(empdept, D), \i(empphone, P)} 
D = {div, dept, addr} 

Dr = {} 

P — {emp, phone} 

Pr = {e(empphone, E)} 

empdept = $E t D({ediv, edept}) 

empphone = ^p,E({emp}) 

Note that a multivalued reference may have a corresponding entity reference 
arising out of the same constraint; for example, e (empphone, E). 

Definition 7. Let S be an entity table, let T be a multivalued table, and let 
r M = T) be a multivalued reference from S. Since this implies that there 
exists an entity reference from T, r £ = s(F,S), we call r £ a redundant entity 
reference corresponding to r M . 

Definition 8. If T is a table and R is the set of entity references and multivalued 
references from T, then we define an abstract table A(T) — (T', R r ) where T r C T 
and R' C R satisfy the following conditions: 

1. T' contains no foreign key attributes involved in an entity reference from 
T or multivalued reference to T0 

2. A multivalued reference to some table U can be included in R' only if 
there exists no other abstract table A(V) = (V , R' v ) where R' v contains 
a multivalued reference to U. 

3. For any A(W) = (W , R' w ) where W is an entity table, an entity reference 
r from T to W can be included in R' only if r is not a redundant entity 
reference corresponding to a multivalued reference in R' w . 

We will write A(T) = ({ai, • • , a n }^ • • ? ^n}) m a shorthand: 

A(T) : ai, . . . ,a n ,ri, . . . ,r n 

where a±, . . . , a n , n, . . . , r n may be arranged in any order. 
Our example can now be written as: 



A(E) 
A(D) 
A(P) 



emp, e (empdept, D), ^(empphone, P) 

div, dept, addr 

phone 



*If attributes excluded from T' by this condition are in the primary key of T, then we will 
underline any corresponding entity references in R' . (This does not apply to entity references 
excluded from R r by condition 3.) 
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If there exists a set of tables X = {Ti, . . . , T n } and corresponding A(Ti) = 
(T/,i?-) where Ti G X, we may consider a directed graph with each vertex v 
representing a table Ti and arcs directed from v representing entity references 
and multivalued references in R[. This graph may have directed cycles. The 
smallest possible directed cycle is formed by a single entity table containing 
a recursive entity reference, such as an employee having a manager that is 
also an employee, e.g. A(E) : emp, e(empmgr, E). As an example with two 
tables, if an employee can report to multiple managers, we might have A(E) : 
emp, ^(empmgr, M) and A(M) : e(mgr,E). 

3 Attributes 

In designing a new table, we will discuss attributes at a more abstract level 
than attributes in the relational model. We ask two questions about each new 
abstract attribute: 

1. Repeating: Is the attribute allowed to contain a Single (exactly one) value 
or Multiple (zero or more) values? [S or M] 

2. Defining: Is the attribute domain Nonentity (defined by the database 
system as a type or domain^) or Entity (defined by a table)? [N or E] 

The characteristics, repeating and defining, are independent of each other. S 
and M are mutually exclusive, and TV and E are mutually exclusive, in the sense 
that only one of each characteristic may be chosen. 

The choice between N and E involves a few considerations. An E attribute 
has a domain defined by the elements of a table. Each tuple of the defining 
table represents one element of the domain. Conceptually, an E attribute ref- 
erences an independent entity which can have its own properties via additional 
attributes. The entity can also be referenced by other tables. For example, in 
the previous section, employees are defined as working only in a department 
contained in the department table. In that case, the employee department is an 
E attribute, and the department address is a property of the department entity 
defined in the department table. Other tables in addition to the employee table 
could reference the department table. On the other hand, an N attribute is not 
constrained by the values of another table, and there should be no attributes 
that describe its properties. In the previous example, a department's address 
is probably an TV attribute because it is unconstrained (assuming there is no 
universal database of possible addresses) and does not have properties of its 
own defined by other attributes. It probably also will not need to be used as a 
domain by other tables. 

The choice of M includes cases where an attribute is optional (allowing 
either one value or no value). Selecting M for an optional attribute is preferred 
over using an S attribute with null values. Also, M is intended mainly for cases 



tFor example, CHARACTER or INTEGER in SQL. 
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Table 1: Number of database tables required 





Nonentity (N) 


Entity (E) 


Single (S) 





1 


Multiple (M) 


1 


2 



in which a variable number of values is possible. If the number of values is 
constant, then it may be preferable to create a fixed number of S attributes. 

When the abstract attributes are translated to the relational model, these 
characteristics determine the number of additional tables that will be needed 
(see Table 1 ) and the placement of foreign keys. 

4 Data model 

This section starts with a new example based on data collection in evolutionary 
biology research. Suppose that one wants to design a database to record individ- 
ual specimens of various organisms collected in the field, and for each individual 
someone has taken photographic images and recorded its organism type (such 
as genus and species). One way to begin would be to define an entity table 
representing the individuals. In order to allow for hybrids, we will store "bio- 
types" defined as combinations of organism types. Both biotypes and organism 
types will be modeled as entities, to allow for the possibility that one may want 
to describe them with additional attributes. We can express the "individual" 
entity / as, "An individual in / is uniquely identified by a name indname (a Sin- 
gle Nonentity attribute), and has photo images images (a Multiple Nonentity 
attribute) and a biotype biotype (a Single Entity attribute)." MN attributes 
are notated using the /i() operator, and SE attributes are notated with sQ: 

A(I) : indname , fidmaqes, G),e(biotyye, B) 

This implies two new tables, G and B. We proceed to defining the multi- 
valued table G: "For each individual in /, G contains a set of unique image 
file names imgfile (a Single Nonentity attribute), each having an associated 
comment notes (a Single Nonentity attribute)." This is notated as: 

A(G) : imgfile, notes 

The entity table B is defined as: "A biotype in B is uniquely identified 
by a name btname (a Single Nonentity attribute), and has organisms orgs (a 
Multiple Entity attribute)." ME attributes are notated by combining /i() and 
s(): 

A(B) : btname , fi(orqs, T) 
A(T) : s{org,Q) 

Next, the entity table O must be defined: "An organism in O is uniquely 
identified by a genus genus (a Single Nonentity attribute) and a species species 
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(a Single Nonentity attribute), and has a common name cname (a Single Nonen- 
tity attribute)." Therefore, including the entire example: 



A(I) 


indname, fidmaqes, G),e(biotype, B) 


A(G) 


imgfile, notes 


A(B) 


btname, fiiorqs, T) 


A(T) 


e(org, 0) 


A(0) 


genus, species, cname 



This would be implemented in a database as: 



I = 


{indname, btname} 


G = 


{indname, imgfile, notes} 


B = 


{btname} 


T = 


{btname, genus, species} 


= 


{genus, species, cname} 


images = 


&g,i ({indname}) 


biotype = 


® i, b ({btname}) 


orgs = 


<&t,b ({btname}) 


org = 


&T,o({genus, species}) 



The same relational schema structure would result if organisms were defined 
as participating in multiple biotypes: 

A(0) : genus, species, cname, u(biotypes, T) 
A(T) : e(biotype, B) 
A(B) : btname 
etc. 

Note that this schema is in 4NF, and redundancies have been avoided by 
removing multivalued attributes to new tables using /i(). Returning to the first 
example in this paper, it is natural to see that an employee's phone numbers and 
list of publications are unrelated, which might suggest A(E) : emp, u(empphone, 
P), ji(emppub,B) where B is a table representing an employee's publications. 
The less desirable A(E) : emp, phones, pubs (probably not in INF) and A(E) : 
emp, phone, pub (not in 4NF) are possible but not as intuitive. 

The use of entity references may help to clarify the role of attributes. The 
problems addressed by 2NF and 3NF can be avoided if attributes that need to 
be described by other attributes are removed to new entity tables using e(). 

This paper suggests that abstracting entity references and multivalued ref- 
erences, while otherwise remaining close to the relational model, can promote 
intuitive normalization. 
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5 Hierarchical application 



If there exists a set of tables X = {T\, . . . ,T n } and corresponding A(Ti) = 
(T/,i?-) where Ti G X, we may consider a directed graph with each vertex 
v representing a table T{ and arcs directed from v representing multivalued 
references in R[. This graph is a set of trees or hierarchies with entity tables 
as the root nodes. We rewrite an abstract table A(T) in this hierarchical form 
using a shorthand denoted by H(T), and we include any multivalued table as 
contained within its parent. If we indicate multivalued tables with parentheses 
and entity references with square brackets, then the biology example from the 
previous section might be described as: 



A larger example shown in Figure 1 describes part of a database for high 
throughput genetic sequencing research!! Here multivalued tables are drawn 
as nested tables and entity references are italicized. (Primary keys are in bold- 
face.) Figure 2 depicts the same schema in a conventional format. 

6 Conclusions 

The paper described an alternative method of data modeling intended for non- 
specialists, based on a simple transformation of relational schemas. 
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■I- The database was designed by Chris Bizon et al. 



H(I) 
H(B) 
H[0) 



genus, species, cname 



indname , images(imgfile, notes), biotype[B] 
btname , orgs(org[0]) 
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Asm : 



Ref : 



asm id Integer 
xtfRef 

sample_linkage Varchar(255) 
loadfilename Varchar(255) 
loaduser Varchar(255) 
load time Timestamp 
notes Varchar( 1023) 
asm ext 



auth Varchar(255) 
name Varchar(255) 



var Var 
horn Boolean 



The Asm table represents sequence assemblies 
along with variants that have been identified in 
the assemblies. An assembly in the Asm table is 
uniquely identified by a database-generated 
surrogate key asm_id, and has an associated 
reference ref (of type Ref), an identifier 
sample _linkage relating the assembly to a 
sample, the name of a file load Jilename from 
which the assembly data were loaded, the name 
of a user loadjuser that loaded the data, a 
recorded time loadjime when the assembly data 
were loaded, an associated miscellaneous notes 
field notes, multiple "external" names asm_ext, 
and multiple variants asm_var. 

— ► Each asm_ext is uniquely identified by a 
naming authority auth and a name. 

— > Each asm_var is uniquely identified by a 
canonical variant var (of type Var), and has a 
Boolean value horn equal to "true" if the variant 
is homozygous. 



ref_name Varchar(255) 
ref loc 



contig Varchar(15) 
pos Integer 
base Char(l) 
ref var 



var Var 
endpos Integer 



The Ref table represents reference sequences. A 
reference sequence in the Ref table is uniquely 
identified by a reference name refjiame, and has 
multiple locations ref Joe. 

—*■ Each ref Joe is uniquely identified by a 
chromosome name contig and a physical location 
pos, and has a reference base base and multiple 
variants ref_var at that location. 

> > Each ref_var is uniquely identified by a 
canonical variant var (of type Var), and has an 
end position endpos relative to the contig. 



Var: 



var_id Integer 

type Enum ("snp", "ins", "del") 
seq Varchar(65535) 



The Var table represents canonical variants. A 
variant in the Var table is uniquely identified by a 
database-generated surrogate key varjd, and has 
a variant type type, and a sequence seq. 



Figure 1: Sequence database, with abstract attributes and hierarchical tables 
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asm 



asm_id integer 



ref_name varchar(255) [F] 
samplejinkage varchar(255) 
loadjilename varchar(255) 
loacLuser varchar(255) 
load_time timestamp 
notes varchar(1023) 



asm_id integer [F] 
var_id integer [F] 



horn boolean 



asm_id integer [F] 
auth varchar(255) 
name varchar(255) 



var_id integer 



type enum ("snp", "ins", "del") 
seq varchar(65535) 



ref_var 



ref_name varchar(255) [F] 
contig varchar(15) [F] 
pos integer [F] 
var_id integer [F] 



endpos integer 



refjoc 



ref 


ref_name varchar(255) [F] 




ref_name varchar(255) 




contig varchar(15) 
pos integer 














base char(1) 



Figure 2: Sequence database, entity-relationship logical model (IDEF1X) 
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